The line above will allow the code to run, it will not appear. The practical reasoning for this is to make the document more clean. (Suppresses the output from the chunk but will run it).
We are working with a dataset from a fictional coffee shop. The data is contained in a CSV file (already uploaded in the Files pane at right). Let’s load the data and take a look at what we have.
Linear Regression
Linear regression is a data analysis technique that predicts the value of unknown data by using another related and known data value
shop =read_csv("coffee_shop.csv")
Rows: 50 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): Week, Sales, Rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This a straightforward dataset with three numeric variables: Week, Sales, and Rating.
summary(shop)
Week Sales Rating
Min. : 1.00 Min. : 274.8 Min. :7.700
1st Qu.:13.25 1st Qu.: 818.8 1st Qu.:8.725
Median :25.50 Median :1457.6 Median :9.000
Mean :25.50 Mean :1463.7 Mean :9.010
3rd Qu.:37.75 3rd Qu.:2045.8 3rd Qu.:9.300
Max. :50.00 Max. :2667.2 Max. :9.800
Our typical data cleaning tasks of type conversion and dealing with missing data are not necessary as the variables are already in the correct type and there is no missing data (how do we know this?).
Let’s take a look at linear regression. The basic idea is that we want to model the linear relationship between two variables. Later, we’ll look at situations where there are more than two variables. We’ll define one of our variables as the response variable. This is the variable that we are trying to predict. In reality, this is typically defined for us before we start to attack a problem. Some or all of the other variables in the dataset may then by predictor variables. We assume that there is a linear relationship between predictor variables and the response variable.
Before we start, the following must be true (for linear regression):
The response variable MUST be numeric (Y: always numeric)
The predictor variables can be numeric or categorical
We assume that the relationship between our predictor variable and the response variable can be modeled with the following relationship (for a single predictor variable model):
\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1\)
In this model, the \(y\) variable is the response, the \(x\) variable is the predictor, the \(\beta_0\)is the y-intercept (the point where the regression line crosses the y axis), and the \(\beta_1\)is the slope of the regression. Recall, from algebra, that if we know the slope and y-intercept that that is sufficient to be able to draw a line.
Best line = Best fit
x - ind. var = predictor
y - dep. var = response
Linear Regression Line (Source: AnalystPrep)
There are several assumptions that must be true in order for us to successfully build a linear regression model:
Linear relationship between \(x\) and \(y\)
LOOKS LIKE A LINE :))
Regression model errors are independent. Error is the difference between a predicted value and the actual value. The direction and size of the “next” error in a model should not be predictable.
ERRORS NEED TO BE INDEPENDENT
The variance of errors is constant. We’ll explore this further later.
DOES IT LOOK LIKE A FUNNEL? SPREAD OF THE ERRORS = THIS MEANS THAT THE GOOD PREDICTIONS ARE CLOSER TO ZER0
Errors should be Normally distributed. We’ll explore this further later.
“NORMALLY” DISTRIBUTED … BELL CURVE?
Our workflow for linear regression looks like this:
Import the data and familiarize ourselves with the dataset (be sure that we can identify which variable in the dataset is the response variable and which variable(s) will serve as predictors).
Clean and prepare the data (e.g., type conversions, missing data, etc.).
Visualize the relationship between each potential predictor variable and the response variable.
Calculate correlation between each NUMERIC predictor and the response variable.
Build and evaluate linear regression models.
Select a “champion” model or set of models.
In our coffee shop dataset, which variable is the response?
Sales is the variable; strong relationship & positively correlated and LINEAR
It’s pretty clear that Week has a STRONG, positive, linear relationship with Sales. Rating has a much weaker relationship (slightly negative).
BIG NOTE: Correlation measures the strength of LINEAR relationship between two variables. It DOES NOT measure that strength well if the relationship is NON-LINEAR!
Correlation Examples
We can now build our linear regression models. Let’s build a model with Week to predict Sales.
model1 =lm(Sales ~ Week, shop)#lm = linear model#sales is our response model#week is our predictor model#shop is the name of our dataset
summary(model1)
Call:
lm(formula = Sales ~ Week, data = shop)
Residuals:
Min 1Q Median 3Q Max
-91.190 -33.055 -1.365 29.513 107.730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 203.8136 13.3103 15.31 <2e-16 ***
Week 49.4083 0.4543 108.76 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 46.35 on 48 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.9959
F-statistic: 1.183e+04 on 1 and 48 DF, p-value: < 2.2e-16
Regression Line
Sales = 203.8 + 49.4*Week
Pr(>|t|)
This means the week variable is a significant predictor of sales😊
<2e-16
Positive means increase, as the week increased - so did sales.
Multiple R-squared: 0.996 - displays model quality (closer to 1 is optimal)
Let’s talk our way through the output. It looks complicated, but there are only a few items that we really need to pay attention to.
Next, we look at our regression diagnostics. The first plot puts the predicted values on the x axis and the residuals on the y. This plot should look like 1) a random scattering of points with no discernible patterns and 2) the spread (variance) of residuals should not change as we work our way from left to right on the plot. Does this happen in this plot?
Here’s an example of patterned residuals (Source: ISLR textbook):
Patterned Residuals
Here’s an example of non-constant variance of residuals (Source: ISLR textbook):
Non-Constant Variance of Residuals
Next is a plot of the distribution of residuals. We can use a histogram for this. Ideally, this histogram should look like a Normal distribution.
week is a significant predictor of sales
positive relationship
the R-squared value is really close to 1
ggplot(model1, aes(x = .resid)) +geom_histogram(binwidth =20) +#experiment with binwidth to get "good" histogramtheme_bw()
Let’s repeat this analysis, but for Rating.
model2 =lm(Sales ~ Rating, shop)
summary(model2)
Call:
lm(formula = Sales ~ Rating, data = shop)
Residuals:
Min 1Q Median 3Q Max
-1209.94 -691.92 19.87 570.85 1205.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3150.1 2110.6 1.493 0.142
Rating -187.2 234.0 -0.800 0.428
Residual standard error: 724.4 on 48 degrees of freedom
Multiple R-squared: 0.01316, Adjusted R-squared: -0.007403
F-statistic: 0.6399 on 1 and 48 DF, p-value: 0.4277
How do we feel about this model? How does it differ from the previous model?
* R- squared is awful - near 0
* Rating is not a significant predictor of sales (-187.2)
We can extend our regression model to have multiple predictors. We’ll go into this in more detail in the future, but here’s the basic idea. The regression equation expands to incorporate more than one predictor variable:
Call:
lm(formula = Sales ~ Week + Rating, data = shop)
Residuals:
Min 1Q Median 3Q Max
-84.911 -32.618 -2.136 24.873 105.671
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 116.1588 138.8182 0.837 0.407
Week 49.4436 0.4605 107.370 <2e-16 ***
Rating 9.6288 15.1779 0.634 0.529
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 46.65 on 47 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.9958
F-statistic: 5841 on 2 and 47 DF, p-value: < 2.2e-16
What’s the interpretation of this model?
NOTE: For a “fair” comparison of models with differing numbers of variables, use the adjusted R-squared value. This value adjusts for the number of variables in the model.